INSTITUT NATIONAL DE RECHERCHE EN INFORMATIQUE ET EN AUTOMATIQUE 



c 
c 

w 

< 

oc 



Quasi-stationary distributions 
as centrality measures of reducible graphs 

Konstantin Avrachenkov — Vivek Borkar — Danil Nemirovsky 



t/: 



1 

in 
q 

oc 

c 
c 



X 



N° 6263 

August 2007 
_ Theme COM 




SOPHIA ANTIPOLIS 



Quasi-stationary distributions 
as centrality measures of reducible graphs 

Konstantin Avrachenko\fl , Vivek BorkaiQ , Danil Nemirovsk)H 

Theme COM — Systemes communicants 
Projets MAESTRO 

Rapport de recherche n° 6263 — August 2007 — fl9l pages 



Abstract: Random walk can be used as a centrality measure of a directed graph. However, if 
the graph is reducible the random walk will be absorbed in some subset of nodes and will never 
visit the rest of the graph. In Google PageRank the problem was solved by introduction of uniform 
random jumps with some probability. Up to the present, there is no clear criterion for the choice 
this parameter. We propose to use parameter-free centrality measure which is based on the notion 
of quasi-stationary distribution. Specifically we suggest four quasi-stationary based centrality 
measures, analyze them and conclude that they produce approximately the same ranking. The 
new centrality measures can be applied in spam detection to detect "link farms" and in image 
search to find photo albums. 
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Distributions quasi-stationnaires 
comme les mesures de centralite pour des graphes reductible 

Resume : Une marche au hasard peut etre utilisee comme mesure de centralite d'un graphe 
oriente. Cependant, si le graphe est reductible la marche au hasard sera absorbee dans un quelque 
sous-ensemble de noeuds et ne visitera jamais le reste du graphe. Dans Google PageRank, le 
probleme a ete resolu par l'introduction des sauts aleatoires uniformes avec une certaine probabi- 
lity. Jusqu'a present, il n'y a aucun critere clair pour le choix de ce parametre. Nous proposons 
d'utiliser la mesure de centralite sans parametre qui est basee sur la notion de la distribution 
quasi-stationnaire. Nous analysons les quatre mesures et concluons qu'elles produisent presque le 
meme classement de noeuds. Les nouvelles mesures de centralite peuvent etre appliquees dans le 
context de la detection de spam pour detecter les "link farms" et dans le context de la recherche 
d'image pour trouver des albums photo. 

Mots-cles : mesure de centralite, marche au hasard, graphe oriente, distribution quasi-stationnaire, 
PageRank, graphe du Web, link farm 
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1 Introduction 

Random walk can be used as a centrality measure of a directed graph. An example of random 
walk based centrality measures is PageRank [2 lj used by search engine Google. PageRank is used 
by Google to sort the relevant answers to user's query. We shall follow the formal definition of 
PageRank from [18]. Denote by n the total number of pages on the Web and define the n x n 
hyperlink matrix P such that 

!l/di, if page i links to j, 
1/n, if page i is dangling, (1) 
0, otherwise, 

for i, j = 1, n, where di is the number of outgoing links from page i. A page with no outgoing 
links is called dangling. We note that according to ([T]) there exist artificial links to all pages from 
a dangling node. In order to make the hyperlink graph connected, it is assumed that at each step, 
with some probability c, a random surfer goes to an arbitrary Web page sampled from the uniform 
distribution. Thus, the PageRank is defined as a stationary distribution of a Markov chain whose 
state space is the set of all Web pages, and the transition matrix is 

G = cP + (l- c)(l/n)E, 

where E is a matrix whose all entries are equal to one, and c £ (0, 1) is a probability of following 
a hyperlink. The constant c is often referred to as a damping factor. The Google matrix G is 
stochastic, aperiodic, and irreducible, so the PageRank vector it is the unique solution of the 
system 

TrG = 7T, irl = 1, 

where 1 is a column vector of ones. 

Even though in a number of recent works, see e.g., 0E1IB], the choice of the damping factor 
c has been discussed, there is still no clear criterion for the choice of its value. The goal of the 
present work is to explore parameter-free centrality measures. 

In 13 [15] the authors have studied the graph structure of the Web. In particular, in [15] 
it was shown that the Web Graph can be divided into three principle components: the Giant 
Strongly Connected Component, to which we simply refer as SCC component, the IN component 
and the OUT component. The SCC component is the largest strongly connected component in the 
Web Graph. In fact, it is larger than the second largest strongly connected component by several 
orders of magnitude. Following hyperlinks one can come from the IN component to the SCC 
component but it is not possible to return back. Then, from the SCC component one can come to 
the OUT component and it is not possible to return to SCC from the OUT component. In [7|[T5] 
the analysis of the structure of the Web was made assuming that dangling nodes have no outgoing 
links. However, according to ([I]) there is a probability to jump from a dangling node to an arbitrary 
node. This can be viewed as a link between the nodes and we call such a link the artificial link. 
As was shown in [5], these artificial links significantly change the graph structure of the Web. In 
particular, the artificial links of dangling nodes in the OUT component connect some parts of the 
OUT component with IN and SCC components. Thus, the size of the Giant Strongly Connected 
Component increases further. If the artificial links from dangling nodes are taken into account, it 
is shown in [5] that the Web Graph can be divided in two disjoint components: Extended Strongly 
Connected Component (ESCC) and Pure OUT (POUT) component. The POUT component is 
small in size but if the damping factor c is chosen equal to one, the random walk absorbs with 
probability one into POUT. We note that nearly all important pages are in ESCC. We also note 
that even if the damping factor is chosen close to one, the random walk can spend a significant 
amount of time in ESCC before the absorption. Therefore, for ranking Web pages from ESCC we 
suggest to use the quasi-stationary distributions [9j 122] . 

It turns out that there are several versions of quasi-stationary distribution. Here we study four 
versions of the quasi-stationary distribution. Our main conclusion is that the rankings provided 



RR n° 6263 



4 



Avrachenkov, Borkar & Nemirovsky 



by them are very similar. Therefore, one can chose a version of stationary distribution which is 
easier for computation. 

The paper is organized as follows: In the next Section [2] we discuss different notions of quasi- 
stationarity, the relation among them, and the relation between the quasi-stationary distribution 
and PageRank. Then, in Section [3] we present the results of numerical experiments on Web Graph 
which confirm our theoretical findings and suggest the application of quasi-stationarity based 
centrality measures to link spam detection and image search. Some technical results we place in 
the Appendix. 



2 Quasi-stationary distributions as centrality measures 

As noted in |5j, by renumbering the nodes the transition matrix P can be transformed to the 
following form 

R T 



P 



where the block T corresponds to the ESCC, the block Q corresponds to the part of the OUT 
component without dangling nodes and their predecessors, and the block R corresponds to the 
transitions from ESCC to the nodes in block Q. We refer to the set of nodes in the block Q as 
POUT component. 

The POUT component is small in size but if the damping factor c is chosen equal to one, the 
random walk absorbs with probability one into POUT. We are mostly interested in the nodes 
in the ESCC component. Denote by ttq a part of the PageRank vector corresponding to the 
POUT component and denote by ttt a part of the PageRank vector corresponding to the ESCC 
component. Using the following formula [20j 

tt(c) = ^1 t [I -cP}-\ 
n 

we conclude that 

n T {c) = —l T [I-cT]-\ 

n 

where 1 is a vector of ones of appropriate dimension. 
Let us define 

I Ft(c) 111 

Since the matrix T is substochastic, we have the next result. 
Proposition 1 The following limit exists 

7T T ( C ) 1 T [I-T}- 1 



7Pr(l) = lim 



i|Mc)||i i T [i-T]-ir 

and the ranking of pages in ESCC provided by the PageRank vector converges to the ranking 
provided by 7Pr(l) as the damping factor goes to one. Moreover, these two rankings coincide for 
all values of c above some value c* . 

Next we denote 7Pr(l) simply by wt- Following 0H2] we shall call the vector %t pseudo-stationary 
distribution. The z th component of ttt can be interpreted as a fraction of time the random walk 
(with c = 1) spends in node i prior to absorption. We recall that the random walk as defined in 
Introduction starts from the uniform distribution. If the random walk were initiated from another 
distribution, the pseudo-stationary distribution would change. 

Denote by T the hyperlink matrix associated with ESCC when the links leading outside of 
ESCC are neglected. Clearly, we have 

T 

T 3 
ij ~ [Tl]i' 
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where denotes the i th component of vector Tl. In other words, [Tl]i is the sum of elements 

in row i of matrix T . The entry of the matrix T can be considered as a conditional probability 
to jump from the node i to the node j under the condition that random walk does not leave ESCC 
at the jump. Let ttt be a stationary distribution of T. 

Let us now consider the substochastic matrix T as a perturbation of stochastic matrix T. We 
introduce the perturbation term 

eD = T — T, 

where the parameter e is the perturbation parameter, which is typically small. The following 
result holds. 

Proposition 2 The vector tt t is close to ttt- Namely, 

ttt = ttj* — Tf T — (Tf T eDl)l T X l + l T X — (fr T eDl) + o(e), (2) 

where ut is the number of nodes in ESCC and Xq is given in Lemma [7] from the Appendix. 
Proof: We substitute T = T — eD into [/ — T] _1 and use Lemma[U to get 

[1-^ = ^1^X0 + 0(4 
Using the above expression, we can write 

1 T [I — T] _1 ^7Di n T7t T + 1 T X + 0(e) Tr T + ±(ir T eDl)l T X + o(e) 



1 T [I — T] _1 l -l mnT + 1 T Xol + o(e) l + ±;(n T eDl)lTX l + o(e) 

tt t + — (fr T eDl)l T X + o(e)) (l - — (tt t eD1)1 t X 1 + o(e 

= 7f T - 7f T — (7f T eL>l)l T X l + 1 T X — (WD1) + o(e). 
rix tit 

□ 

Since Rl + Tl = 1 and Tl = 1, in lieu of ffxeDl we can write tttRI- The latter expression 
has a clear probabilistic interpretation. It is a probability to exit ESCC in one step starting from 
the distribution ttt- Later we shall demonstrate that this probability is indeed small. We note 
that not only tttRI is small but also the factor l/rir is small, as the number of states in ESCC 
is large. 

In the next Proposition [3] we provide alternative expression for the first order terms of ttt- 
Proposition 3 

tt t = 7f T - eTT T DH + el T — (fr T Dl)H + o(e). 

riT 

Proof: Let us consider ttt as power series: 



From J2]) we obtain 



tt t = 7f T - 7f T — (fr T eDl)l T X a l + 1 T X — (ir T eDl) + o(e) = 
tit ut 

= TT T + e (i t X q — (tt t D1) - ttt — (tttD1)1 t X 1 ) + o(e), 
V n T n T J 
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and hence 



tt! 13 = l T X Q — {n T Dl) - Tr T — {n T Dl)l T X l, (3) 

tit riT 

where Xq is given by lf30|) . Before substituting (f30|) into ([3]) let us make transformations 



X = (I — X_iD)H(I — DI_i) = 

= H-HDX-i -X^DH + X^ 1 DHDX_ 1 , 

where X_i is defined by (|29|) . Pre-multiplying JT by 1 T , we obtain 

1 T X = 1 T H -ttt^HDI^tttDI)- 1 -nTTTT^TDVj^DH + (4) 
+ n T Tr T DHDlTr T (ir T Diy 2 . 

Post-multiplying Xq by 1, we obtain 

X 1 = X-iDHDX-il- HDX_ X 1 

and hence 

1 T X 1 = n T Tt T DHDl(TT T Diy 2 - 1 T HD1(% T D1)- X . (5) 
Substituting and (@]) into ([3j) , we get 

tt! 1 ' = 1 T X — {n T Dl) - 7f T — (7f T £>l)l T X l = 

= 1 T H— (n T Dl) - — n T l T HDl - n T DH + 

tit tit 

+ ttt^tDHD^^tDI)- 1 - 7r T (7f T £)fl"Dl)(7f T £)l)" 1 + —tx t \ t HD\ = 

riT 



l 1 H — (tt t D1) - n T DH. 



Thus, we have 



n { S> = l T H—{Tr T Dl)-n T . 

Tlx 

□ 

Next, we consider a quasi-stationary distribution [9l [22] defined by equation 

tttT — Xittt, (6) 

and the normalization condition 

n T l = 1, (7) 

where Ai is the Perron-Frobenius eigenvalue of matrix T. The quasi-stationary distribution can 
be interpreted as a proper initial distribution on the non-absorbing states (states in ESCC) which 
is such that the distribution of the random walk, conditioned on the non-absorption prior time 
t, is independent of t [TT] . As in the analysis of the pseudo-stationary distribution, we take the 
matrix T in the form of perturbation T — T — eD. 
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Proposition 4 The vector t^t is close to the vector t^t- Namely, 

ttt — — etttDH + o(e). 

Proof: We look for the quasi-stationary distribution and the Perron-Frobenius eigenvalue in the 
form of power series 

7T T = 4° } + + £2 4 2) + • • • > ( 8 ) 

Al = l + £ AW+ £ 2 Af ) +.... 

Substituting T — T — eD and the above series into iJH) , and equating terms with the same powers 
of e, we obtain 

4°>f = 4°\ (9) 
4 1 )f-4 0) J D = i4 1) +Ai 1) 4 0) , (io) 

Substituting |(8|) into the normalization condition |(7j), we get 

4 0) i = i, (ii) 

4 1} 1 = 0. (12) 
From ([9]) and Ijlip we conclude that 4 = ^t- Thus, the equation lflO|) takes the form 

_ n -,-(1) 

7Trp 1 — -KtO — l7Ty + 7TT- 

Post-multiplying this equation by 1, we get 

4 } T1 - ittD\ = l4 X) l + A^tttI- 
Now using Tl = 1, |(TT]) and §V2j), we conclude that 

A^ = -tv t D1, 

and, consequently, 

Ai = 1 - £7r T m + o(e). (13) 
Now the equation lfT0|) can be rewritten as follows: 

4 1} [^-r] = TT T [(tT T D1)I - D]. 

Its general solution is given by 

4 X) = ^ttt + tt t [(tt t D1)I - D]H, 

where v is some constant. To find constant v, we substitute the above general solution into 
condition (fl2|) . 

4^1 = ^tttI + 7f T [(7fr-Dl)J - D]H1 = 0. 
Since 7Tt1 = 1 and HI = 0, we get v = 0. Consequently, we have 

4 } = 7f T [(7fr-Dl)7 - D]H = (Tt T Dl)w T H - n T DH = -ir T DH. 
In the above, we have used the fact that tttH = 0. This completes the proof. 

□ 

Since Ai is very close to one, we conclude from (fl3|) and the equality etttDI = tttRI that 
indeed tttRI is typically very small. 

There is also a simple relation between Ai and ttt- 
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Proposition 5 The Perron- Frobenius eigenvalue Ai of matrix T is given by 

Ai = 1 - tt t R1. (14) 
Proof: Post-multiplying the equation J6]) by 1, we obtain 

Ai = tttTI. 

Then, using the fact that Tl = 1 Rl we derive the formula Jl] 



□ 

Proposition [5] indicates that if Ai is close to one then 7Pri?l is small. 

As we mentioned above the entry of the matrix T can be considered as a conditional 
probability to jump from the node i to the node j under the condition that random walk does not 
leave ESCC at the jump. 

Let us consider the situation when the random walk stays inside ESCC after some finite number 
of jumps. The probability of such an event can be expressed as follows: 

/ N \ 



P[X 1 =j\X =iA f\ X m €S , 



\ m— 1 / 

where ESCC is denoted by S for the sake of shortening notation and N is the number of jumps 
during which the random walk stays in ESCC. 

Let us denote by T^ N) the element of T N (the N th power of T) and by T- (Ar) the i th row of the 
matrix T N . Then 

T {N) = = ( TT N-1). = T . T JV-1_ 

Proposition 6 

/ n \ T T (7V_1) 1 

P[X 1 =j\X = iA /\ X m eS| = V \ N)i ■ (15) 



m— 1 



Proof: see Appendix. 
Then, if we denote 



=P[x 1 =j\X =iA 



N \ 

A X m G S 

m=l ) 



we will be able to find stationary distributions of , which can be viewed as generalization of 
ttt- Let us now consider the limiting case, when N goes to infinity. 

Before we continue let us analyze the principle right eigenvector u of the matrix T: 



Tu = Aiit, (16) 

where Ai is as in the previous section, the Perron-Frobenius eigenvalue. 

The vector u can be normalized in different ways. Let us define the main normalization for u 

as 

l T u = TIT- 

Let us also define u as 

u 

u = - — , so that tttu = 1, (17) 

7TTU 

and 

u 

u = — — , so that tttu = 1. (18) 
■ktu 
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Proposition 7 The vector u is close to the vector 1. Namely, 

u = l- eHDl + o(e). 

Proof: We look for the right eigenvector and the Perron- Frobenius eigenvalue in the form of power 
series 

u = u^+eu^+e 2 u^ + .... (19) 

A 1 = l + £ A«+ £ 2 Af ) + .... 

Substituting T = T — eD and the above series into (f!6| , and equating terms with the same powers 
of e, we obtain 

fu^=u^, (20) 

TuW-DuW=vW + \Pu<-°\ (21) 
Substituting lfl9|) into the normalization condition l|17p . we obtain 

7f T M (0) = 1, (22) 

7f T M (1) = 0. (23) 
From (|20| and lj22"j) we conclude that = 1. Thus, the equation lj2"T]) takes the form 

fn«-m = n« + Af ) l. 
Pre-multiplying this equation by ttt, we get 

7fy?2^ — TTtDI = TTtU^ + 7TtA^1. 

Now using Tl = 1, (|22|) and ((23j) , we conclude that 

a^ 1} = -7f T m, 

and, consequently, 

Ai = 1 - e7f T Dl + o(e). 



Now the equation l|2Tj) can be rewritten as follows: 

[/ - f] = [(w T Dl) I — D]l. 

Its general solution is given by 

= vl + H [(n T Dl) I - D] 1, 

where ^ is some constant. To find constant v, we substitute the above general solution into 
condition (|23|) . 

7f T uW _ ^ T ]_ _|_ jj- T ff [(7f T £)l) / - £)] 1. 

Since 7Tt1 = 1 and 7Pr-ff = 0, we get v = 0, Consequently, we have 

In the above, we have used the fact that HI = 0. This completes the proof. 

□ 

We note that the elements of the vector u can be calculated by the power iteration method. 
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Proposition 8 The following convergence takes place 



where Ti is the i th row of the matrix T. 
Proof: 





Tie 








Ax ' 




^ = 






T(Te 


tttTu^ 1 ) 


Al 


A? 




tM 2) 


TiT 2 e 




tttTuW 


A? ' 





□ 

Let us consider the twisted kernel T defined by 

rp _ TijUj 



I J 



Aiu, 



As one can see the twisted kernel does not depend on the normalization of u. Hence, we can take 
any normalization. 

Proposition 9 The twisted kernel is a limit of i|15p as N goes to infinity, that is 

Tu = lim 3 3 



Proof: 



v w"oo t (.n) 1 



rp(N)^ 1% 3j"pN-ll \ T,T"-'l 



lim = — — lim = — — 

Using ((24"|) . we can write 

lim vTj - T * u * 



After renormalization, we obtain 



□ 

The twisted kernel plays an important role in multiplicative ergodic theory and large deviations 
for Markov chains, see, e.g., |14j . The matrix T is clearly a transition probability kernel, i.e., 
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Tij > Vi, j, and J2j Tij = 1 V£, Also, it is irreducible if there exists an path i — > j under T for all 
which we assume to be the case. In particular, it will have a unique stationary distribution 
ttt associated with it: 

7Tt = ktT, (25) 
tt t 1 = 1. (26) 

If we assume aperiodicity in addition, can be given the interpretation of the probability of 
transition from i to j in the ESCC for the chain, conditioned on the fact that it never leaves the 
ESCC. Thus, ttt qualifies as an alternative definition of a quasi-stationary distribution. 

Proposition 10 The following expression for ttt holds: 

7T T = KTi&i- (27) 

Proof: The normalization condition lj26|) is satisfied due to (jTHJ) - Let us show that (|25|) holds as 
well, i.e. 

TTTj = ^^tAj, 

i=l 

where jit is the dimension of 7Tt- And for the right hand side of l(27|) we have 

E~ nh \ ^ - ~ 1 ij u j \ " ~ ~ 1 ij u j u j \ ■ 

~ Aiw, f-f Aim, Ai 

l — l l — l l — l 

□ 

This suggests that ttti, or equivalently TrnUi, may be used as another alternative centrality 
measure. Since the substochastic matrix T is close to stochastic, the vector u will be very close to 1. 
Consequently, the vector ttt will be close to ttt and to tt as well. This shows that in the case when 
the matrix T is close to the stochastic matrix all the alternative definitions of quasi-stationary 
distribution are quite close to each other. And then, from Proposition [TJ we conclude that the 
PageRank ranking converges to the quasi-stationarity based ranking as the damping factor goes 
to one. 



3 Numerical experiments and Applications 

For our numerical experiments we have used the Web site of INRIA (http://www.inria.fr). It 
is a typical Web site with about 300 000 Web pages and 2 200 000 hyperlinks. Since the Web 
has a fractal structure [TO] , we expect that our dataset is sufficiently representative. Accordingly, 
datasets of similar or even smaller sizes have been extensively used in experimental studies of novel 
algorithms for PageRank computation [H [TBI E2] • To collect the Web graph data, we construct our 
own Web crawler which works with the Oracle database. The crawler consists of two parts: the 
first part is realized in Java and is responsible for downloading pages from the Internet, parsing the 
pages, and inserting their hyperlinks into the database; the second part is written in PL/SQL and 
is responsible for the data management. For detailed description of the crawler reader is referred 
to [3]. 

As was shown in [15], a Web graph has three major distinct components: IN, OUT and 
SCC. However, if one takes into account the artificial links from the dangling nodes, a Web graph 
has two major distinct components: POUT and ESCC [5]. In our experiments we consider the 
artificial links from the dangling nodes and compute ttt, 7Tt, ^t, and ttt with 5 digits precision. 
We provide the statistics for the INRIA Web site in Table 1. 

For each pair of these vectors we calculated Kendall Tau metric (see Table 2). The Kendall Tau 
metric shows how two rankings are different in terms of the number of swaps which are needed to 
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INRIA 


lotal size 


318585 


INUIllUei OI IlOUeh 111 ovjV_> 




Number of nodes in IN 





Number of nodes in OUT 


164443 


Number of nodes in ESCC 


300682 


Number of nodes in POUT 


17903 


Number of SCCs in OUT 


1148 


Number of SCCs in POUT 


631 



Table 1: Component sizes in INRIA dataset 









ttt 


ttt 


7Tt 


1.0 


0.99390 


0.99498 


0.98228 


TTj> 




1.0 


0.99770 


0.98786 


TTT 






1.0 


0.98597 










1.0 



Table 2: Kendall Tau comparison 



transform one ranking to the other. The Kendall Tau metric has the value of one if two rankings 
are identical and minus one if one ranking is the inverse of the other. 

In our case, the Kendall Tau metrics for all the pairs is very close to one. Thus, we can conclude 
that all four quasi-stationarity based centrality measures produce very similar rankings. 

We have also analyzed the Kendall Tau metric between ttt and PageRank of ESCC as a 
function of damping factor (see Figure [T]). As c goes to one, the Kendall Tau approaches one. 
This is in agreement with Proposition [TJ 

Finally, we would like to note that in the case of quasi-stationarity based centrality measures 
the first ranking places were occupied by the sites with the internal structure depicted in Figure [2l 
Therefore, we suggest to use the quasi-stationarity based centrality measures to detect "link farms" 
and to discover photo albums. It turns out that the quasi-stationarity based centrality measures 
highlights the sites with structure as in Figure [2] but at the same time the relative ranking of the 
other sites provided by the standard PageRank with c = 0.85 is preserved. To illustrate this fact, 
we give in Table 3 rankings of some sites under different centrality measures. Even though the 
absolute value of ranking is changing, the relative ranking among these sites is the same for all 
centrality measures. This indicates that the quasi-stationarity based centrality measures help to 
discover "link farms" and photo albums and at the same time the ranking of sites of the other type 
stays consistent with the standard PageRank ranking. 





tt t (0.85) 


ttt 


ttt 


ttt 


ttt 


http : 1 /www. inria.fr/ 


1 


31 


189 


105 


200 


http : / /www. loria.fr / 


13 


310 


1605 


356 


1633 


http : / lwww.irisa.fr/ 


16 


432 


1696 


460 


757 


http : / /www — sop.inria.fr/ 


30 


508 


1825 


532 


1819 


http : / /www — rocq.inria.fr/ 


74 


1333 


2099 


1408 


2158 


http : / /www — futurs.inria.fr/ 


102 


2201 


2360 


2206 


2404 



Table 3: Examples of sites' rankings 
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Figure 1: The Kendall Tau metric between ttt and PageRank of ESCC as a function of the 
damping factor. 



Figure 2: The album like Web site structure 
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4 Conclusion 

In the paper we have proposed centrality measures which can be applied to a reducible graph to 
avoid the absorbtion problem. In Google PageRank the problem was solved by introduction of 
uniform random jumps with some probability. Up to the present, there is no clear criterion for the 
choice this parameter. In the paper we have suggested four quasi-stationarity based parameter- 
free centrality measures, analyzed them and concluded that they produce approximately the same 
ranking. Therefore, in practice it is sufficient to compute only one quasi-stationarity based central- 
ity measure. All our theoretical results are confirmed by numerical experiments. The numerical 
experiments have also showed that the new centrality measures can be applied in spam detection 
to detect "link farms" and in image search to find photo albums. 
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Appendix 

Here we present a couple of important auxiliary results. 

Lemma 1 Let T be an irreducible stochastic matrix. And let T(e) = T — sD be a perturbation 
of T such that T(e) is substochastic matrix. Then, for sufficiently small e the following Laurent 
series expansion holds 

[I-T(e)]- 1 = ^X_ 1 +X a +eX 1 + ..., (28) 

with 

X_r = (29) 

X = (I-X_ 1 D)H(I-DX_ 1 ), (30) 
where n is the stationary distribution ofT and H = (I — T + Iff) -1 — Iff is the deviation matrix. 

Proof: The proof of this result is based on the approach developed in [2, 4J. The existence of the 
Laurent series (|28"]l is a particular case of more general results of [4] . To calculate the terms of the 
Laurent series, let us equate the terms with the same powers of e in the following identity 

(I — f + eU)(-JC_i + X + eXx + ...) = I, 

e 

which results in 

{I-f)X_ 1 = Q, (31) 
(/ - f)X + DX_! = I, (32) 
(I - T)Xi + DX a = 0. (33) 

From equation lf3lj) we conclude that 

X_i = (34) 



where /u_x is some vector. We find this vector from the condition that the equation (|32|) has a 
solution. In particular, equation 11321) has a solution if and only if 

tt(I — DX-i) = 0. 
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By substituting into the above equation the expression {34|) . we obtain 

7T — 7fZ?l/l_l = 0, 

and, consequently, 

1 

which together with JM]) gives {29]). 

Since the deviation matrix H is a Moore-Penrose generalized inverse of J — T, the general 
solution of equation 11321) with respect to Xq is given by 

X = H(I - + 1/xo, (35) 



where is some vector. The vector /io can be found from the condition that the equation {33 
has a solution. In particular, equation (|33|) has a solution if and only if 

nDXo = 0. 

By substituting into the above equation the expression for the general solution lj35j) . we obtain 

ttDH(I - DX^) + nDlno = 0. 

Consequently, we have 

^^-^-TtDH(I-DX^) 



and we obtain {30 



□ 



Proposition 11 



Proof: 



N \ rp rp(N-l)-. 

J-ij-lj 1 



P[x 1 =j\X = iA f\ X m eS] = 



T {N) 1 



N 



Plx 1 = J \X =iA /\X m £S\ = 
\ m=l / 

p(x = iAX 1 = jA/\Z =2 X m es 



P (x = i A A„=i ieS 



Denominator: 
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N 



p{x = iA /\ x m es \ = 

\ m=l J 

= P ( X = i A f\ \/ X m = k m j = 

\ m=lk m eS J 

= P (x Q = iA y Xi=fciA /\ \/ X m = k m j = 

= p(x = ») ^]pui = fciA A V x ™ = fc ™) = 

ki£S \ m=2k m eS J 

= p(x a = i)J2P (Xi = h\x = i) p ( f\ \y x rn = fc m |X! = h j = 

= P(X =z) ^ P(X 1 = fci|X = i)p( \/ ^2 = fc 2 A /\ V X m = fc m |Xi = fcx J - 

ki£S \fc 2 GS m=3fe„ l eS' / 

= P(X =i)J2 = fc il X o = ^ P ( X 2 = k 2 A f\ \/ X m = fc m |X! - = 

ki£S fc 2 GS \ m=3k m eS J 

= P{X =i)Y J P(X 1 = k 1 \X = i) 

fclGS 

Y P ( A V X ™ = fc ™l X2 = ^2 A Xl = fc) P (X 2 = fc 2 |X! = fcx) = 

fc 2 es \m=3fe m es / 
= P(X =i) P(Xi = k 1 \X = i) 

fclGS 

£P( A \J *rn= k m \X 2 = k 2 ) P (X 2 = k 2 \X, = h) = 

k 2 eS \m=3k m eS ) 

= P(X = i) J2 P(Xi = k 1 \X = i) 

feiGS 

Y P ( A \J x m = k m \X 2 = k 2 \p (X 2 = k 2 \X x = h) = 

fc 2 es \m=3k m es J 

= p(x = i) y p ( A V x ™ = k ™\ x * = fc2 ) 

k 2 ES \m=3k m eS ) 

Y P(X 2 = k 2 \X 1 = k 1 )P(X 1 = k 1 \X = i) = 

fciGS 

= p(x = i) y p ( A V x ™ = k ^ = fc2 ) p (* 2 = fc2 i x ° = o = 

fc 2 GS \m=3fe m GS / 

= P(X = z) ^ P( V X 3 = k 3 A A V ^m = fc ro |X 2 = A; 2 J P(X 2 = fc 2 |X = i) = 

k 2 es \k 3 eS m=Ak m eS ) 

= P (X Q = i) Y P ( X * = h A A V X ™ = k ™\ X * = fc2 ) P ( X 2 = fel^o = i) = 

k 2 esk 3 es \ m=ik m es ) 
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= P(x a = i)J2J2 p (f\ V x ™ = fc ™l*3 - h ax 2 = k 2 j 

k 3 esk 2 es \m=4k m eS ) 
P (X 3 = k 3 \X 2 = k 2 ) P (X 2 = k 2 \X = i) = 

= P(X =i)Y^ J2 P [ A \J X m = kra\X 3 = k 3 ) 

P (X 3 = k 3 \X 2 = k 2 ) P (X 2 = k 2 \X = i) = 

= P(X =l)J2 P (h V X ™ = fc ™l X 3 = fc 3 ) 
k 3 eS \m=4k m eS / 

Y / P(X 3 = h\X 2 = k 2 ) P (X 2 = k 2 \X Q = %) = 

k 2 £S 

= P(X = i) P ( A \/ X m = k m \X 3 = k 3 \ P(X 3 = k 3 \X =») = ••• 

k 3 eS \m=4fe m eS / 

= P(X =l) P ( A V ^ = *m|^-2 = *JV-2 J P (^-2 = ^-21^0 = 

= P(X =z) P \ V ^iV-l = fciv-l A \/ ^ = £^^-2 = ^-2 

P(X N _ 2 -fc w _ 2 |X = *) = 
= P(X = z) ^ 51 p(x N - 1 = k N - 1 /\ \/ X N = k N \X N - 2 = k N - 2 \ 

P(X N - 2 = k N _ 2 \X = i) = 
= P(X =i) 2 P ( V -X'jv = Ajv|-X"jv-i=Ajv-iAA-j V _2 = *:jv-2] 

P (-XjV-1 = fcjV-l|^jV-2 = fcjV-2) P (^CiV-2 = fcw-2|^0 = «) = 

= P(X = x) ]T H P ( V *i\r = *Jv|*JV-i=*JV-i) 

P (-XjV-1 = fcjV-ll-XjV-2 = fcjV-2) P (^CiV-2 = fcw-2|^0 = i) = 

= P(X =z) ^ P( \/ X j v = fc J v|^_ 1 = fc JV -i] 
feiv-ies \k N es J 

P (Xn-i = k N ^\X N ^ 2 = k N _ 2 ) P (X N _ 2 = fc w _ 2 |X = i) = 

fejv-2es 

= P(X =z) 2 P( \/ X w = M^-i = fcw-i)P(^v-i = ^-i|*o = = 
few-iGS \k N es J 

= P(X =i) ]T P(X N = k N \X N _ 1 =k N - 1 )P(X N _ 1 =k N - 1 \X = i) = 

k N -!ES k N eS 

= P(X =i)Y / Yl P(X N = k N \X N _ 1 =k N - 1 )P(X N _ 1 =k N - 1 \X = i) = 

kN^S fcjv — 1 

= P(X =i) J] P(X N = k N \X = i) = 

k N es 
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fcjv = l 

= T} N hP(X = i) = 



P I Xo = i A 



N \ 

/\X m eS\ = T^ N) lP(X = t) 

m=l ) 



Numerator: 



JY 



P {X =iAX 1 =jA f\ X m eS = 

\ m=2 / 

P \ A V ^m = fc m ] P(X 1 =j|X =i)P(X 
\m=2fe m eS / 



T y T ) (Ar ~ 1) lP(Xo = 4 ) 



P ( Xo = i A X x = j A /\ X m € 5 j = TijT^ N 1] 1P (X = i) 

\ m=2 / 



□ 
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